Imputing missing genotypes: effects of methods and patterns of missing data
نویسندگان
چکیده
Costs of high-throughput genotyping have decreased to the point where it appears economically feasible to use molecular genetic marker information in applied breeding programs. Some practical questions remain to be addressed about how best to deal with missing data in the resulting genotype datasets, to minimize the impact of the missing data on the accuracy of breeding value prediction. Data can be missing for two reasons – first, genotyping assay failure is likely for at least some loci in some samples; and second, it may prove economically desirable to invest more resources for high-density genotyping of a few individuals and fewer resources for lower-density genotyping of many individuals [1]. The proportion of missing genotypes may range from less than one percent due to genotyping assay failure, to over 80% if a selective genotyping strategy is used. Many methods for predicting genetic merit of trees using marker genotype data require complete genotype information for mathematical reasons. It is therefore important to use efficient statistical methods to accurately impute missing genotypes. In species with complete reference genome sequences available, the map order of markers and linkage disequilibrium (LD) information can be used to guide imputation of missing genotypes. Completely sequenced reference genomes are available for only two forest tree species, so these methods are not suitable for most forest trees. Gengler et al. [2] described a method to impute missing genotypes using mixed linear models and BLUP. We determined the effect on accuracy of BLUP estimated breeding values of imputation with different levels (10%, 20%, 40%, 60% and 80%) of missing genotypes. Analyses were conducted both with empirical data (3461 SNP markers in a cloned loblolly pine population of 165 genotypes) and simulated data, using missing data created by random sampling (some loci missing in all individuals) or by structured sampling (all loci missing in some individuals). Simulations were used to examine the effect of family and progeny size, mating design, proportion of missing genotypes, genotyping strategy and the method for imputation on the accuracy of breeding values. Imputed genotypes were obtained using the numerator relationship matrix (the A matrix) and solving the mixed model equations of y = Xb + Mu + e, where y is the vector of gene content predictions, X is the design matrix (vector of 1s) for the mean, M is the design matrix connecting trees to the gene content vector y, u is the individual tree effect and e is the error variance. The solutions of mixed model equations produce predicted SNP genotypes for trees with missing genotypes. The solutions would be continuous, centered on 1 because the gene content values are 0, 1 or 2. Imputation of missing genotypes in empirical data from an unbalanced mating design with family sizes ranging from 1 to 35 was more powerful for data with structured missing genotypes at all levels of missing data than for data with random missing genotypes with same proportions of missing data. The accuracy of imputation for 10% and 80% missing genotypes ranged between 0.96 to 0.23 and 0.96 to 0.16 for structured and random missing genotypes in the data, respectively. As the proportion of missing genotypes increased in the data, the power of imputation decreased. With simulation, we found that the imputation was less affected by the distribution of missing genotypes in a balanced mating design with families of equal size. The accuracy of imputation ranged between 0.97 to 0.75 for the 10% and 80% missing genotypes in the data, respectively. * Correspondence: [email protected] Cooperative Tree Improvement Program, Department of Forestry and Environmental Resources, North Carolina State University, Raleigh, NC, USA Full list of author information is available at the end of the article Ogut et al. BMC Proceedings 2011, 5(Suppl 7):P61 http://www.biomedcentral.com/1753-6561/5/S7/P61
منابع مشابه
A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.
We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships ...
متن کاملCriteria of GenCall score to edit marker data and methods to handle missing markers have an influence on accuracy of genomic predictions
The aim of this study was to investigate the effect of different strategies for handling lowquality or missing data on prediction accuracy for direct genomic values of protein yield, mastitis and fertility using a Bayesian variable model and a GBLUP model in the Danish Jersey population. The data contained 1 071 Jersey bulls that were genotyped with the Illumina Bovine 50K chip. After prelimina...
متن کاملMissing data imputation in multivariable time series data
Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...
متن کاملData-driven methods for imputing national-level incidence in global burden of disease studies
OBJECTIVE To develop transparent and reproducible methods for imputing missing data on disease incidence at national-level for the year 2005. METHODS We compared several models for imputing missing country-level incidence rates for two foodborne diseases - congenital toxoplasmosis and aflatoxin-related hepatocellular carcinoma. Missing values were assumed to be missing at random. Predictor va...
متن کاملPerformance evaluation of different estimation methods for missing rainfall data
There are numerous methods to estimate missing values of which some are used depending on the data type and regional climatic characteristics. In this research, part of the monthly precipitation data in Sarab synoptic station, east Azerbaijan province, Iran was randomly considered missing values. In order to study the effectiveness of various methods to estimate missing data, by seven classic s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 5 شماره
صفحات -
تاریخ انتشار 2011